Processing Wikipedia Dumps - A Case-study Comparing the XGrid and MapReduce Approaches
نویسندگان
چکیده
We present a simple comparison of the performance measured as the total execution time taken to parse a 27-GByte XML dump of the English wikipedia on three different cluster platforms: Apple’s XGrid, and Hadoop the open-source version of Google’s MapReduce. We use a local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon. We show that for selected benchmark, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon’s Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targetted for much larger data sets.
منابع مشابه
Identification of the Origin and Behaviour of Arsenic in Mine Waste Dumps Using Correlation Analysis: A Case Study Sarcheshmeh Copper Mine
Knowledge of the probable origin and behaviour of arsenic certainly gives valuable insights into the potential for transfer in the environment and of the risks involved in mining sites. Sequential extraction analyses are common experiments often used to study the origin and behaviour of potentially toxic elements. The method, however, presents some deficiencies, including labor-intensive proced...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملMatching Dispute Finder Claims to Wikipedia Articles
Dealing with large datasets is increasingly becoming a problem for natural language processing researchers. For our class project we investigate applying the opensource Hadoop MapReduce framework to the problem of information retrieval using TFIDF.
متن کاملWhy Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling
It is well known that the output quality of statistical machine translation (SMT) systems increases with more training data. To obtain more parallel text for translation modeling, researchers have turned to the web to mine parallel sentences, but most previous approaches have avoided the difficult problem of pairwise similarity on cross-lingual documents and instead rely on heuristics. In contr...
متن کاملImplementation and Evaluation of a Framework to calculate Impact Measures for Wikipedia Authors
Wikipedia, an open collaborative website, can be edited by anyone, even anonymously, thus becoming victim to ill-intentioned changes. Therefore, ranking Wikipedia authors by calculating impact measures based on the edit history can help to identify reputational users or harmful activity such as vandalism [4]. However, processing millions of edits on one system can take a long time. The author i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011